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ABSTRACT 

We introduce a fluid dynamics algorithm that performs with nearly spectral accu- 
racy, but uses finite-differences instead of FFTs to compute gradients and thus executes 
10 times faster. The finite differencing is not based on a high-order polynomial fit. The 
polynomial scheme has supurb accuracy for low-wavenumber gradients but fails at high 
wavenumbers. We instead use a scheme tuned to enhance high-wavenumber accuracy 
at the expense of low wavenumbers, although the loss of low-wavenumber accuracy is 
negligibly slight. A tuned gradient is capable of capturing all wavenumbers up to 80 
percent of the Nyquist limit with an error of no worse than 1 percent. The fact that 
gradients are based on finite differences enables diverse geometries to be considered and 
eliminates the parallel communications bottleneck. 



1. Introduction 

The spectral fluid algorithm (Canuto et. al. 1987) uses Fast Fourier Transforms (FFTs) 
to compute gradients, the most precise means possible. Finite-difference gradients based on a 
polynomial fit execute faster than FFTs but with less accuracy, necessitating more grid zones to 
achieve the same resolution as the spectral method. The loss of accuracy outweighs the gain in 
speed and the spectral method has more resolving power than the finite-difference method. We 
introduce an alternative finite-difference formula not based on a polynomial fit that executes as 
quickly but with improved accuracy, yielding greater resolving power than the spectral method. 

In section 2 we derive high-wavenumber finite difference formulas and exhibit their effect on 
the resolving power of a turbulence simulation in section 3. In section 4 we apply finite differences 
for the purpose of mimicking the spectral algorithm, and then proceed to other applications in 
section 5. 



2. Finite differences 

Define a function fj{xj) on a set of grid points Xj = j with j an integer. Then construct a 
gradient /'(O) at x = from sampling a stencil of grid points with radius (or order) S on each side. 
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The familiar result for the gradient on a radius-1 stencil is /'(O) ~ (/i — /_i)/2, which is obtained 
from fitting a polynomial of degree 2 to fj. For a degree 4 polynomial on a radius-2 stencil, 

/'(0)~^/-2-^/-l + ^/l-^/2. 

For a stencil of order S, 

s 

no) ~ Yl ^jf^ (1) 

where M_j = -M,-. 

Consider the finite-difference error at x=0 for a Fourier mode sin(7rfex). Cosine modes can be 
ignored because they don't contribute to the derivative at a; = 0. Note that the wavenumber k is 
scaled to grid units so that k = 1 corresponds to the maximum wavenumber expressible on the grid, 
also known as the Nyquist limit. The finite difference formula (1) gives /^(O) ~ 2 J2^=i sin{jk). 
Whereas the correct value should be kir, define an error function 

s 

Es{k) = fcTT - 2 ^ Mj sin(7rA;j). 
j=i 

Figure 1 shows Es{k) for stencils of radius 1 through 24. The first order finite difference formula 
is quite lame, delivering 1 percent accuracy only for k less than 0.f2. Brandenburg (2001) recog- 
nized that higher-order finite differences can significantly extend the wavenumber resolution of the 
polynomial-based finite difference. The 8th order finite difference accuracy is better than 1 percent 
up to A; = .56 and the 16th order finite difference up to A; = .66. Nevertheless these are still far from 
the Nyquist limit of A; = 1 and even higher-order finite-differences yield little further progress. 

A Fourier transform gives the correct gradient for all k up to unity. This is why the spectral 
algorithm delivers the most resolution per grid element. The resolution limit is set by the maximum 
k for which gradients can be captured accurately. Although the 8th order finite-difference formula 
involves considerably fewer floating point operations than an FFT, the loss of resolution still renders 
it less efficient than the spectral method. 

Polynomial-based finite differences have high accuracy at low k but fail at large k. We can 
instead construct a more practical scheme to improve high-fc accuracy at the expense of low-A; 
accuracy, yet the loss of low-A; accuracy is negligibly small. Prom equation 2, we see that the 
problem of computing accurate gradients reduces to optimizing, or "tuning" the coefficients Mj 
to minimize the error over a suitable range of k, or equivalently to construct a sine series that 
best mimics a linear function. A set of tuned coefficients appear in table 2 with the associated 
error functions in figure 1. The error in the radius-8 tuned finite difference is less than 1 percent 
up to A; = .80, a dramatic improvement over the radius-8 polynomial. An algorithm based on 
tuned gradients still has a lower maximum A; than the spectral algorithm but due to its increased 
speed it has greater resolving power (section 3). Henceforth we denote these tuned gradients as 
"hypergr adients . " 
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2.1. Tuning the hypergradient coefficients 

Minimizing the error function involves a multiparameter optimization of the coefficients Mj, - 
a problem-dependent task with multiple legitimate options. In fact, a high degree of customization 
is possible for the form of the error function. For this application we proceed as follows. Define a 
target kmax an indicator for the quality of the tuned coefficients: 

4 



E 



dk 



tt/c - 2 ^ Mj sin(7rjA;) 



Then, perform a multi-dimensional optimization on Mj. The use of a fourth power ensures an 
approximately uniform error within < A; < kmax, although a weight function could be added to 
further customize the form of the error function, kmax is then adjusted until the error is 1 percent. 
The procedure is repeated for each order S to yield the coefficients in table 2. 

It is worth noting that the radius-8 tuned coefficients are similar to the radius-24 polynomial 
coefficients. This is not surprising because the polynomial coefficients are too small to matter 
outside of radius 8. 



Stencil radius 
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16 
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Polynomial 
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.24 


.34 


.40 


.44 


.48 


.56 


.66 


.72 


Hypergradient 


.20 


.38 


.54 


.64 


.70 


.74 


.80 


.84 


.92 



Table 1: Maximum resolved wavenumber K for the polynomial and hypergradient finite 
ences shown in figure 1. The tolerance in the relative error is 1 percent. 
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Table 2: The finite- difference coefficients Mj. The label "P{S)" corresponds to the polynomial- 
based finite difference on a radius S stencil. 'T{S)" denotes the tuned hypergradient finite difference 
coefficients engineered to have a relative error of 1 percent. The entry labeled "Interp" is for the 
tuned coefficients used in interpolating halfway between two grid points. The entry labeled "Fade" 
gives the coefficients for the Fade derivative in explicit finite- difference form. 
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Table 3: Finite- difference coefficients for large stencil radii, with the same notation as for table 
2. The label 'FFT" denotes the coefficients for a Fourier transform (section 2.2). 



The case p' = \ involves a cyclic tridiagonal matrix, p' = 2 a cyclic pentadiagonal matrix, etc. 
The Fade scheme is not compact because the gradient evaluation draws information from all grid 
points, propagated around the grid by the cyclic matrix. The contribution from each grid point 
can be explicitly evaluated from the matrix inverse to express it in the form of equation 1. The 6th 
order tridiagonal scheme with F[ = 1/3, Pi = 7/9, P2 = 1/36 yields the coefficients denoted "Fade" 
in tabic 3, where the entries Mj have significant magnitudes up to j = 6, in spite of the fact that 
Pj extends only to j = 2. The Fade scheme has a maximum wavenumber ol K = .50 whereas the 
radius-6 tuned scheme has if = .74. The tuned scheme is maximally compact in the sense that it 



2.2. Compactness 



The Fade scheme evaluates implicit derivatives: 




(2) 
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Fig. 1. — Figure 1: The relative gradient error for the polynomial-based finite difference (dotted 
line) and tuned finite difference (solid line). Numbers indicate the stencil radius. 
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has the best wavenumber-accuracy properties if the information to be drawn from is confined to 
within a given stencil radius. 

The spectral gradient is particularly noncompact. The coefficients for a size N transform 
expressed in the form of equation 1 are 

^^■= A^E^«i^(2^^-?V^) (iV^oo). (3) 

s=l 

It is interesting to note that these are the coefficients for the hnear function expressed as an infinite 
sine series. Unfortunately, quite a few terms are needed before the series starts to resemble the 
linear function. 



2.3. Higher-order gradients and other operations 

Diffusion involves second-order gradients. For these, we utilize the cosine modes to construct 
an error function analogous to equation (2): 

s 

Ef^ {k) = {kirf + Mo + 2 ^ MjCos{Trkj). 

For fourth-order gradients, 

s 

E^g\k) = -{kTrf + Mo + 2 ^ MjCos{iTkj). 

i=i 

A low pass filter can mimic a high-order hyperdiffusivity. To construct this, replace the (vr/c)^ term 
with something that is zero at low k and large at high k with a sharp transition at the desired 
cutoff k. 

An interpolation halfway between two grid points is useful for resolution refinement and dealias- 

ing: 

s 

E^p{k) = -TT + 2^MjCos{Trk{j - 1/2)). (4) 
Here, i = 1 corresponds to the nearest point, j = 2 to the next most distant point, etc. 



3. Resolving power of a turbulence simulation 

We address two elements of computational efficiency - execution speed and resolution per grid 
element, which together constitute the resolving power. Other factors exist that won't be considered 
here, such as Lagrangian vs. Eulerian timesteps, and variable timesteps. 
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Define the execution speed as the number of grid zones (or particles, for an SPH code) processed 
per CPU GigaFlop per second, or Kilo-Elements per GFlop per second (Kegs). Also define a 
resolution per grid element (K) as a measure of the effective resolution of a grid element or particle: 

^ _ kMax 
kNyquist 

where kuax is the maximum wavenumbcr for which accurate gradients can be computed and 
kpfyquist is the Nyquist wavenumber, the maximum wavenumber expressible on the grid, k^yquist = 
tt/A for a grid spacing of A. The value of K for various finite difference schemes appears in table 
1. 

The resolving power (R) is the measure of the speed of a code at a fixed benchmark resolution. 
For example, a code with high K requires fewer grid cells and hence executes faster than a code 
with lower K. The scaling for a 3D algorithm based on explicit time differences of second-or-higher 
order is 

R = [Kegs] • i^l 
For a flux-transport algorithm, it is i? = [Kegs] • K'^. 

Many algorithms such as the Riemann shock solver aren't based on explicit gradients and lack 
an easily definable K. K can alternatively be defined in terms of the maximum achievable Reynolds 
number Re on an grid: Re = {ANK)*/^. Maron & Cowley (2001) found that a 128^ spectral 
simulation with a, K = 2/?> dealiasing truncation had Re = 2500 implying A ~ 4.1. For algorithms 
with intrinsic numerical viscosity plus an explicit Laplacian viscosity, the value of the viscosity 
parameter can be varied to identify the minimum meaningful viscosity and from that an effective 
Reynolds number follows. 



3.1. Transforms 

A 3D spectral code is based on real-to-complex and complex-to-real Fast Fourier Transforms 
(FFTs) on an N'^ grid. This transform requires 7.5 log2(A'^) floating point operations per grid 
point. The FFT is difficult to optimize and executes at only some fraction of the peak floating point 
speed. For the purposes of wallclock execution time, we may think of the FFT as having effectively 
7.5log2{N)/ fpFT operations per grid point where fpFT is the efficiency factor associated with 
algorithm. We further identify separate efficiency factors for the serial and parallel aspects of the 
algorithm by Fppj, and Fppj, respectively. The FFT poses a challenge to optimization because the 
ID stage of the transform involves highly irregular memory access (and hence memory latency), the 
3D stage requires non-consecutive array access over the slow indices, and the parallel stage requires 
communication between every processor pair. Frigo and Johnson (1998) met this challenge in grand 
style with their cross-platform application "Fastest Fourier Transform in the West (FFTW)," which 
is usually the best performer on any given architecture. On the Pittsburgh Supercomputing Center's 
Alpha ES45 "Lemieux" it does especially well, with fppj, = 0.32 and fppj. = 0.38 for a 1024^ 
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transform on 512 processors. For a 512^ grid on 256 processors, /pprp = 0.32 and fppj' = 0.44. 
Other large-grid, large-CPU transforms perform similarly (figure 2). We take these parameters as 
the practical limit of FFT efficiency. 
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Fig. 2. — Figure 2: The performance of FFTW for 256^, 512^, and 1024^ transforms on the 
Pittsburgh Supercomputing Center's MPI-parallel machine "Lemieux." The efficiency factor is the 
fraction of the peak floating point speed. Note that the parallel inefficiency takes effect at a low 
number of CPUs and plateaus at a high number of CPUs. 
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3.2. The finite-difference gradient transform 

The finite-difference gradient transform (equation 1) on a radius S stencil lias a floating point 
operation count per grid element of (45' — 2). Unlike the FFT, memory access is linear and parallel 
communition occurs only between the two adjacent processors. The transform can be optimized 

so that the serial and parallel efficiency factors /J.^, and f^'^ arc much closer to unity than for the 
FFT case. On the Pittsburgh Supercomputing Center's Alpha ES45 "Lemieux", /|i£, = 0.85 and 
fpD ~ ^-^ yield an overall finite-difference efficiency factor oi fpD = 0.68. 

3.3. Pade gradients 

The Pade scheme is nonlocal - the transform has to be computed simultaneously on a ID cut 
that transverses the volume. The operation count is 5 adds and multiplies plus one divide for the 
tridiagonal scheme, and 7 adds and multiplies plus one divide for the pentadiagonal scheme (Lele 
1992). On the HP ES45, a divide takes 12 clocks, and since it can't be pipelined the effective cost 
is 24 floating point operations. This is because ordinarily the ES45 produces one add and one 
multiply result per clock cycle. The total number of floating point operations is then 34/fpade for 
the tridiagonal scheme and 38/fpade for the pentadiagonal scheme, where fpade is the efficiency 
factor of the algorithm. 

The fact that the Pade scheme is nonlocal eliminates the possibility of adaptive resolution 
through a variable stencil size, and also degrades the performance at boundaries. The fact that 
it's a matrix operation makes it more difficult to pipeline efficiently whereas the tuned finite- 
difference gradient is a straightforward local convolution. The locality of the tuned gradient involves 
communication only between adjacent processors whereas the Pade method is nonlocal and requires 
an all-to-all communications stage. This seriously degrades the Pade scheme's parallel scalability, 
especially for large numbers of CPUs (section 3.7). 

3.4. Operation count for one timestep 

The floating point operation count per timestep is the operation count per transform times the 
number of transforms. For the transform count, assume a Runge-Kutta 2nd order timestep (RK2) 
so that the field gradients are evaluated twice per timestep. A spectral hydro simulation involves 
3 transforms from Fourier to real space and 6 transforms back to Fourier space, all done twice for 
a total of 18 transforms. The MHD case has 30 transforms. 

Aliasing error limits the spectral resolution of a bilinear partial differential equation to K = 

2/3, or K = \/8/9 = .94 with the inclusion of a phase shift correction (Camito ct. al. 1987). 
The correction is implemented by calculating the time derivatives on two grids - the original grid 
and a grid shifted to the zone center - and averaging the results after shifting back to the original 
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grid. The procedure can be coordinated with the Runge-Kutta algorithm by calculating the two 
Runga-Kutta stages on grids shifted diagonally from each other by half a grid zone (Canuto et. al. 
1987). The grid shift adds negligible extra computation for the spectral algorithm because it can 
be carried out in Fourier space as a mere multiply. For a finite-difference code it costs two real- 
space interpolations per dynamical field. The interpolation can be tuned for enhanced wavenumber 
accuracy analogously with finite-difference gradients. The radius-4 hypergradient is accurate up to 
K = .64 and needs no grid-shift aliasing correction whereas higher-order hypergradients do. 

To estimate the operation count of the finite-difference method we assume the adiabatic equa- 
tions of MHD, although other forms are possible. 

dtV = - V ■ VV - p-^VP + p-\V xB)xB + uV^Y + v^V^Y (5) 





dtB = -V X 


(V X B) + TyV^B + 774 V^B 




dtp = -V • (pV) 


p ~ pT V • B = 
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Velocity field 


B Magnetic field 


V 


Viscosity 


7] Magnetic resistivity 




4th order hyperviscosity 


774 4th order magnetic hyperresistivity 


P 


Fluid Pressure 


p Density 


7 


Adiabatic index 





For hydrodynamics (without a magnetic field) there are 9 velocity and 3 density gradients. A 
magnetic field adds 9 magnetic gradients. Diffusion terms don't add to the computation because 
they only need to be applied once every few timcstcps (section 3.5). Each gradient is evaluated 
twice per timestep for Runge-Kutta order 2. The grid shifts are evaluated twice per field per 
timestep for stencil radii larger than 4. These considerations determine the number of transforms 
in table 4. Note that for the finite difference and Fade cases, each coordinate direction is counted 
as one transform. For the FFT case, the 3D transform is counted as one transform. 
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Table 4-' The speed, resolution, and resolving power for spectral and finite-difference algorithms. 
The value of K = .94 for the spectral algorithm arises from the y/8/9 rule for the 3D staggered- grid 
dealiasing procedure (Canuto et. al. 1987). Execution speed is estimated from the transform time 
while other overhead is ignored. The finite- difference efficiency factor is approximately fFD ^ -7 

Assuming that the finite-difference efficiency factor fFD = fsfp can be optimized to a better 
degree than for the FFT transform, the tuned gradient method could potentially be up to 7 times 
faster than the spectral method. The most efficient hypergradient configuration is with a stencil 
radius of 4 because larger stencil radii require an aliasing correction. More resolution could be 
achieved with a loss of efficiency by using larger-radius stencils. 

3.5. Diffusion 

Diffusion serves two roles: the removal of energy cascading to the inner scale, and as a dealiasing 
filter. If diffusion is the only agent acting in the Navier-Stokes equation, the Fourier component V 
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evolves as dtY = —uk'^Y. Defining A:< to be the limiting resolution, the inner-scale modes evolve 
with a timescale of t< ~ The diffusion v is generally set so that t< corresponds to the 

dynamical cascade timescale so that diffusion balances energy cascading to the inner scale. This 
is also known as the Lagrangian timescale. The timestep, however, is proportional to the Eulerian 
timescale, the time for the fluid to flow a distance equal to the resolution scale. Generally, the 
Lagrangian timescale greatly exceeds the Eulerian timescale and so diffusion has only a small effect 
each timestep. We can therefore economize by applying kinetic and magnetic diffusion only once 
every few timesteps with a corresponding increase in the values of and rj. This effectively removes 
diffusion from the computational load. In section 4 we find that once every four timesteps yields 
equivalent results as once every timestep. Furthemore, the coefficients of the second and fourth 
order diffusion operators as well as those for a dealiasing filter can be added to compactify them 
into one operator. 

3.6. Magnetic divergence 

Magnetic divergence can be suppressed either with a scheme that preserves it exactly, such 
as the constrained-transport algorithm (Stone & Norman, 1992 I, 1992 II), or by letting it evolve 
unconstrained and periodically repairing it such as with an FFT. The constrained-transport scheme 
derives electric field interpolations and gradients from first first-order finite differences and hence 
has poor wavenumber resolution. High-wavenumber finite differences improve this situation and in 
fact magnetic divergence grows slowly enough so as to not be concern (section 4.) 

3.7. Parallelization 

Good parallel scalability occurs if the demand for communications is less than that for float- 
ing point arithmetic so that both can occur simultaneously, otherwise communications hinder ex- 
ecution. The relevant indicator is the ratio R of floating point operations per second divided 
by the transmission rate of floating point variables between processors. For a machine such as 
the Pittsburgh Supercomputing Center's 3000 processor HP Alpha ES45 "Lemieux," the ratio is 
RArch = 2GFlops / (68 • lO^reals/s) - 30 (San Diego Supercomputing Center study, 2003). CPU 
technology tends to improve more rapidly than communications and so this ratio may increase over 
the next few years (table 5). 

A finite-difference transform on a radius S stencil invokes 2S' — 1 adds and S multiplies, 
effectively totaling 45" — 2 operations since add and multiply units appear in pairs. Summed over 
the three coordinate directions this is 12S — 6 floating point operations. For parallel execution, 
let the data be distributed in slabs of dimension [N, N, N') where the grid size is N'^ and the 
number of CPUs is C = N/N' . To compute the z gradient transform every processor summons and 
sends 2N'^S reals to and from adjacent processors. The two-way MPI pass proceeds at double the 
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one-way rate quoted in table 3.9. The ratio of floating point operations to passed variables is then 
Rfd = N'{6 — 3/S). Efficient parallel scalability can occur if RpD > RArch or N' > 4. 

For the Fade transform, the entire ID coordinate slice has to be on the same processor. For 
slab geometry, the x and y transforms can be done with the data on hand and then a global 
interprocessor transpose reorganizes the data along z for the z transform. A final transpose returns 
to the original slab to calculate the time derivatives. The ratio of fioating point operations to passed 
variables is Rpade = 3F/2 where F is the number of floating point operations per grid element in 
the ID transform. With F = 34 for the radius-2 tridiagonal configuration and F = 38 for the 
radius-3 pentadiagonal configuration (table 4) it's at the threshold of efficient parallelizability. 

The spectral method consists of ID x, y and z transforms to convert from Fourier space to real 
space and back. Only one transpose per z transform is necessary because the real-space products 
can be conducted in the transposed state. Therefore the ratio of floating point operations to passed 
variables is Rfft = fs where fpp^^ ~ -32 is the efficiency factor for the serial stage of the 
computation (section 3.1). In spite of the fact that Rfft is approximately 8 times RArch ^oi the 
supercomputer "Lemieux", the parallel efficiency is only f^prp = .38, implying that the all-to-all 
transpose is substantially slower than the peak transmission rate would suggest. Since the Fade 
transform has about 10 times fewer floating point operations than the spectral transform, it is likely 
that the Fade scheme would be seriously inefficient in parallel execution. Since the finite-difference 
transform only involves passing between the adjacent CFUs, it does not suffer from this problem. 



3.8. Memory 

A code based on tuned finite differences can calculate the entire time update for small sections 
of the simulation volume at a time, eliminating the need for large temporary storage so that only 
the dynamical fields {Y,'B,p} count toward the memory requirement. This is 7 arrays of size 
(AT, N, N') where N' is N divided by the number of CFUs. It is safe to assume that the temporary 
storage plus the memory taken by the operating system will exceed the size of one array. Given that 
computer memory tends to come in powers of 2, we assume that the total memory requirement per 
CPU is equal to the size of 16 arrays of 4 Byte reals. Table 6 shows how grids less than or equal to 
1024^ can be fit easily onto most supercomputers and a 2048^ grid could fit onto a supercomputer 
with 512 CFUs and 1 GB/CFU. N' should be at least 4 for good parallel scalability (section 3.7), 
and so 2048^ simulations are feasible on most existing supercomputers (table 5). 

A code based on Fade gradients cannot be arranged to compute the time update for subsections 
successively as could be done for tuned finite differences. This is because the Fade transform is 
nonlocal whereas the tuned finite difference operates on a local stencil only. Hence, the entire 
grid has to be updated all at once, demanding temporary storage for each partial derivative and 
expanding the memory requirement beyond 16 arrays per CFU. This is also true for a spectral 
code. 
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3.9. Supercomputers 



Machine 


Name 


Clock 


GFlops RAM CPUs 


MPI 


Pipe 


LI 


L2 


L3 RAM 






(GHz) 




(GB) 




speed 


stages 


(kB) (MB) (MB) speed 














(GB/s) 








(GB/s) 


NCSA Xeon 


Tungsten 


3.1 


6.2 


1.5 


2560 


.5 






0.5 




ORNL Cray XI 


Phoenix 


0.8 


12.8 


4.0 


512 


5 




16 


2.0 


200 


PSC HP ES45 


Lemieux 


1.0 


2.0 


1.0 


3016 


.25 


4 


64 


8.0 


5.2 


ORNL IBM Power4 Cheetah 


1.3 


5.2 


1.0 


864 


.25 


6 


32 


1.5 


32 200 


UIUC Intel Itanium 


Titan 


0.8 


3.2 


2.0 


256 


.1 










UIUC Intel PentiumPlatimim 


1.0 


1.0 


1.5 


968 


.01 


5 








Japan NEC 


Earth Sim 


0.5 


8.0 


2.0 


5120 












Iowa Athlon 


Zephyr 


1.5 


1.0 


1.0 


32 


.25 


4 








Athlon Opteron 




1.6 


3.2 


1.0 








64 


1 


5.3 


SGI Altrix 




1.3 


2.6 










32 


.25 


3 6.4 



Table 5: Academic supercomputers. For Cheetah, some CPUs have 4 CB/CPU hut most have 
1. The French Alpha ES45 "Ixia" and "Nympea" have the same characteristics as Lemieux. Most 
of these numbers are from Dunigan 2003. The two-way MPI passing speed is double the tabulated 
one-way speed. Some entries are missing because this kind of information is often hard to come by 
on the web. 

11 12 13 



Grid 


CPUs 


Memory/CPU 






(GB) 


512^ 


64 


1/8 


512=* 


128 


1/16 


1024=* 


128 


1/2 


1024^ 


256 


1/4 


2048^ 


128 


4 


2048-'* 


256 


2 


2048^ 


512 


1 



Table 6: Parallel configurations and memory requirements for a finite- difference code. 

4. Imitating an incompressible spectral MHD code 

It is possible to mimic the function of an incompressible spectral code with high-wavenumber 
finite differences, and with significantly more resolving power (table 4). The principal advantage of 
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simulating incompressible turbulence is that the sound speed does not impact the Courant timestep 
condition. We start with the equations of MHD (equations 5 through 7) but with p set to unity and 
the pressure term deleted. In its place we add terms of the form i//)V(V • V) and r7/)V(V • B) to 
the fluid and magnetic equations respectively to suppress the growth of divergence in the velocity 
and magnetic fields, and %]£) are artificial divergence diffusivities that will be discussed in section 
4.1. 



4.1. Divergence correction 

An artificial term of the form dtV = z^d VV ■ V in the Navier Stokes equation removes diver- 
gence while preserving the solenoidal component. Define Vy = k- V/A; and Vj^ = V — Vy, and then 
the artificial term has the effect of 5tV|| = — i^/jAj^Vy and 3tV_|_ = 0. It's not surprising that the 
procedure is most effective for large k modes. Fully removing the divergence for all modes requires 
a global operation such as can be accomplished with an FFT. This procedure is instead based on 
locally calculated gradients, un is set so that the divergence in the largest k modes is completely 
eliminated in one timestep: 

_ 1 _ 1 

where N is the number of grid points in each dimension and k^yquist is the Nyquist wavenumber. 
A larger value overcompensates for the largest k modes. This procedure serves to suppress the 
growth of divergence for several timesteps, but not indefinitely. Ultimately a complete correction 
involving FFTs must be applied, especially for low k modes. Figure 3 shows that divergence can be 
held to less than 1 percent of the maximum by applying the FFT divergence corrector once every 
8 timesteps. At this frequency, the FFT contributes negligibly to the computational load. 

Define an energy spectrum: < V"^ > /2 = J Ey{k)dk, where Ey{k) = 27rA;^ < |y(A;)p > . 
Also define a divergence spectrum Dv{k) = 27r/c^ < |k • V(A;)|^ > and a divergence normalization 
spectrum Dv{k) = 27rA:^ < k'^\V{k)'\^ > . The divergence normalization serves to define a fractional 
divergence Dy{k)/ Dy{k). Analogously define a magnetic spectrum EB{k), a magnetic divergence 
spectrum Dsik), and a magnetic divergence normalization spectrum DB{k). 

4.2. Diffusivity 

Neither the diffusivity operator nor the spectral divergence correction need to occur every 
timestep (section 3.5). Nearly equivalent evolution results from applying them only once every few 
timesteps, lessening their impact on the compuational load. Let the diffusivity be applied once 
every Idiff timesteps with v and r] replaced by i^Idiff and rjldiff, with similar replacements for the 
higher order diffusivities. Also, let the spectral divergence correction be applied once every Ifft 
timesteps. 
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4.3. Simulations 

With the need to study both kinetic and magnetic divergence, we ran tests with fuUy turbulent 
kinetic and magnetic fields. Initial conditions come from the magnetic dynamo simulations of Maron 
& Cowley (2001). There, forced isotropic turbulence with a magnetic field is evolved to its long-term 
saturated state where the magnetic energy equals the kinetic energy. This is the magnetic analogue 
to the Kolmogorov cascade. We restarted this simulation without forcing to study the divergence 
growth rate. Table 7 lists the simulations that were run for this study. Z677 is a high-resolution 
spectral simulation that serves as a benchmark for comparison. This simulation was run with the 
spectral code "Tulku" (Maron & Goldreich 2001). 

Simulations Z749 through Z754 (table 7) were run to optimize the values of the divergence 
diffusion paramters vd and rjD- No spectral divergence correction was applied (Ifft = oo), leaving 
the divergence suppression solely to vd and rjo. The initial divergence spectrum is identically zero. 
The divergence spectrum after 4 timesteps is plotted in figure 3. Increasing vjj and rjo decreases the 
divergence spectrum until i^d = VD = -016, whereupon further increase results in an overshoot in 
the correction at large k. Taking the value of 0.016 as optimal, we plot the growth of the divergence 
spectrum with this value (simulation Z753) in figure 3. After 4 timesteps, the ratio of the divergence 
to the divergence normalization spectrum is at most 0.015 for V and 10~^ for B. 

In figure 5 we plot the evolution of the finite difference simulation Z748 together with the 
benchmark spectral simulation Z677, both starting from the same initial conditions. After 0.5 




1 10 102 1 10 102 



k/(27T) k/(2TT) 

Fig. 3. — Figure 3: Left: The V and B diveryence spectra for simulation Z753. Numbers indicate 
timesteps. Right: The divergence spectra for simulations Z749 through Z754 after four timesteps of 
evolution, for a sequence of parameters ud and rjo- 
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dynamical times, the spectra (figure 5) and real space fields (figure 4) are in good agreement. 
The spectrum for simulation Z748 is slightly diminished from Z677 because the divergence takes 
with it a small measure of energy. Simulation Z748 has 1^ = I^j = 4 and IpFT = 4. Another 
simulation (Z757) has I^, = Irj = 1 and Ifft = 4 and has exactly the same energy spectrum as 
Z748, establishing that the diffusion operator doesn't have to be applied every timestep. 



Simulal ion 


Grid 


Algorit lini 




' lU 




'/ 


Ifft 

1 1 1 


mil 


Spectral 


128^ 








1 


1 


1 


Z745 


Hypergradient 


64^ 


.016 


.016 


1 


1 


1 


Z746 


Hypcrgradient 


643 


.016 


.016 


2 


2 


2 


Z747 


Hypergradient 


64^ 


.016 


.016 


3 


3 


3 


Z748 


Hypergradient 


64^ 


.016 


.016 


4 


4 


4 


Z749 


Hypergradient 


64^ 


.000 


.000 


4 


4 


oo 


Z750 


Hypergradient 


64^ 


.004 


.004 


4 


4 


oo 


Z751 


Hypergradient 


643 


.008 


.008 


4 


4 


oo 


Z752 


Hypergradient 


643 


.012 


.012 


4 


4 


oo 


Z753 


Hypergradient 


643 


.016 


.016 


4 


4 


oo 


Z754 


Hypergradient 


64^ 


.020 


.020 


4 


4 


oo 


Z756 


Hypergradient 


64^ 


.016 


.016 


4 


4 


1 


Z757 


Hypergradient 


64^ 


.016 


.016 


1 


1 


4 



Table 7: Index of simulations. All simulations have V = T] = 10-3, = = 2.5 • 10-8, 
and At = 0.003, except for Z677, which has At = 0.0012. Ifft = co indicates that the FFT 
divergence correction is never applied. All finite- difference simulations utilize a radius-8 stencil 
and a phase-shift dealiasing correction. 



5. Applications: Interpolation, refinement, and data analysis 

The most accurate and most expensive interpolation procedure is drect evaluation of the 
Fourier series. Tuned finite differences provide a less expensive interpolation high-wavenumber 
interpolation. For example, in 2D, the centered interpolation (equation 4) provides function values 
halfway between grid points, and another interpolation along the diagonal yields the values at the 
grid centers. We have thus doubled the resolution of the grid, which we can do again if we wish. 
Note that we can do this for the entire grid or just for a subsection. After a few doublings, a 
simple linear interpolation serves to provide the function value at any point in space, yielding a 
2D interpolation with the same wavenumber resolution as the component ID interpolation. This 
procedure generalizes to arbitrary dimension. 

As if it wasn't enough trouble to run large simulations on thousands of cpus, one is next 
confronted with analyzing large data cubes that are too big to fit in the memory of one machine. 
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Tuned operators allow for the construction of local local alternatives to global functions like the 
FFT. These include derivatives, dealiasing, filtering, and grid interpolation. Large output files can 
be stored in small easy-to-managc segments and operated on successively. For the purpose of data 
analysis, we have provided radius-16 tuned operators in table 3 that are accurate to 0.3 percent. 
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Fig. 5. — Figure 5: The spectra for simulations Z748 and Z677, described in section 4-3. k is the 
wavenumber and L is the cube side length. 
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